Datasets used

The used dataset is a combination of data for the state of California coming from three main sources:

  1. Economic and social characteristics of the population for 2015 were taken from the 2011-2015 American Community Survey 5-Year Estimates by the United States Census Bureau.
  2. Broadband availability from the Federal Communication Commission.
  3. Presidential 2016 election results from the Statewide Database at U.C. Berkeley Law.

The data is available per census tract. Census tracts are parts of a county. California has 8057 census tracts and 58 counties. Each census tract has its own identification code, consisting of 11 digits. The first two digits present the state code (06 for California), the next three digits give the county code and the last 6 digits give the tract code. (source)

Next to that the tigris library is used to find spatial data in order to plot census tracts, counties and urban/rural areas on maps. And the maps library for population numbers per city.

Census data

The data was available by census tract id (Id). The data did not came in one csv file, different files had to be uploaded and merged on the tract id. Only variables that were needed were kept, variables were renamed and the types were changed if needed for easier merging and plotting.

Census variables:

  • Id (tract id)
  • population
  • number of households (total_hh, owner_hh, renter_hh)
  • household size (total_hhsize, renter_hhsize, owner_hhsize)
  • median household income (median_income)
  • participation rate
  • unemployment rate
  • poverty rate

The census data was combined with data that mapped the county ids to the county names (source). Some wrangling was needed; variables were combined, revalued and renamed.

Added variables:

  • county (county id, in format 06###)
  • county_name

Data on the number of square miles per census tract was also added (source). The surface was given in square meters so this was converted to square miles.

Added variable:

  • square_miles

After merging and joining the data the population per square miles was added.

Added variable:

  • pop_sqmiles (population per square mile)

Broadband data

I read an article that broadband internet is not available in all areas of California, I therefore decided to add data on broadband availability and see if there were any relations with the census data.

The FCC provides data on availability by census block (which is part of a census tract). Data was therefore grouped and summarized by tract. As the data was skewed, it was decided to calculate the median per tract instead of the mean. Only data for tracts where providers can or do offer consumer service were included. Some other tidying up was done for merging reasons. (source)

Variables:

  • Id (tract id)
  • median_down (median maximum advertised downstream speed/bandwith)
  • median_up (median maximum advertised upstream speed/bandwith)
  • nr_providers

Voting data

It seemed interesting how the census and broadban data intereacts with the voting results in the last presidential election. The voting data was provided by precinct instead of census blocks or tracts, therefore the voting results had to be mapped from precincts to census blocks and then summarized by census tracts. Note that precincts can cover multiple blocks and parts of blocks.

One dataset contained the voting results per precinct. Another contained the mapping from precincts to census blocks as well as which part of the total registered votes in that precinct were registered in that block. These tables were combined to find the number of total votes for a particular party per census tract. A categorical variable, winner, was added to determine the winner per tract.

Voting variables:

  • Id (tract id)
  • total_vote (total number of casted votes)
  • dem (percentage of votes for Democratic Party)
  • rep (percentage of votes for Republican Party)
  • winner (Democrats, Republicans, draw)

Note:

  • Other parties participated as well in the presidential elections, but as they received a small % of the votes it was decided to not include them.
  • I have chosen to correct the dataset for voting data which were incorrectly mapped to a water-only tract. I have notified the creators of the dataset.

Univariate Analysis

The three different datasets were joined, which left us with a dataset of 8,057 observations for 23 variables. California has 8,057 census tracts.

## [1] 8057   23

As the social and economic variables come from a census, it can be expected that not all variables are available for each census tract. I will explore what data is missing when looking at the individual variables.

Counties

California has 58 counties. Not ever county has the same number of tracts, it varies widly from only 1 (Alpine & Sierra) to 2,346 tracts (LA). The median is 42 tractcs and the mean 138.9.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    11.0    42.0   138.9   129.2  2346.0

Census data

Population

The population in the tracts is also very diverse. From 0 to 39,454 people. The mean and median are 4,769 and 4,528 respectively.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3417    4528    4769    5832   39454

As the distribution is long tailed I zoomed in to the data so it only shows 99% of the values. Most tracts have a population between 3,000 and 6,000 people. The optimun size of a tract is 4,000 people. 45 tracts have a population of zero.

## [1] 45
Zero population

Looking into this further I found that California has 21 tracts that are water-only. These tracts can be identified by the fact that their ids are in the range of the 9900s. Next to that tracts that have an id in the range of 9800s are special land-use census tracts, such as large parks or employment areas with little or no residential population (source). I added a variable to mark these special tracts.

Variable added:

  • special (values: water, special_land or regular)

As can seen above, most of the tracts that have zero population are indeed either water-only or special land use tracts. However three tracts, 2 in LA and 1 in San Diego have no special indication. I looked up the tracts on this website, where you can see the tract on a map. The tract in San Diego is water, one of the LA tracts is the area where Universal Studios is located and the other tract seems to be a business park. In all three cases it seems the tract id should start with either 9900 or 9800.

I used the tigris library to find all counties and tracts in California and their lat and longs. Please note that only information is available for 8043 tracts, 14 tracts are water-only and are not included in the tigris library. I assume they are out of range of the state land boundaries. In order to plot the information I used this tutorial.

As can be clearly seen Lake Tahoe is made up of water-only tracts. The nothern islands of the Channel Island which are a national park are clearly marked as special land use tracts. Most special tracts lie in the Los Angeles county. Not all tracts are visible either because they are small or lie beyond the state land borders (for water-only tracts).

I decided to remove these 45 tracts due to their zero population, which results in a dataset of 8012 tracts (+ 24 variables).

## [1] 8012   24

Square miles

The size of the tracts is very diverse. The smallest track represent a 2 by 2 block in San Francisco while the largest tracts are close to 6,950 square miles (one tract contains Death Valley NP and the other Mojavo National Preserve). 75% of the tracts have a size of less than 1.787 square miles (3rd quantile).

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.022    0.396    0.729   19.405    1.787 6951.837

Even when the axis are transformed (log10), the distribtuion still has a long tail. 86% of the tracts are smaller than 5 square miles. Smaller tracts are more common than large tracts.

## [1] 85.84623

Population per square mile

As the distributions for both population and square miles are very diverse, it is unsurprisely that the population per square miles is also very diverse. From 0.37 to 173,337 people per square mile. With a median of 6,316 and a mean of 8,607 people. Numbers like 173,337 people per square mile seems very high, but can be explained by highly dense populated small tracts (couple of blocks) in cities.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.37   2694.90   6316.00   8607.28  11020.46 173336.58

The distribution is a mirror of the square miles distribution. Most tracts have a population between 2,500 and 10,000 people per square mile.

Households

Households can be split up in two groups, households that are owning their house and households that are renting. The average home ownership for all tracts is 54.2%, the median is 56.9%. Data is not available for 28 tracts.

Variable added:

  • owner (percentage of households that own their house)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   36.46   56.86   54.17   73.65  100.00      28

Percentage home ownership is sort of linear till 70%, when it drops. Interesting to see is that there are more tracts were all households are renters (0% owner) than tracts where everybody is an owner, which makes sense as it is less likely that everyone owns their house in a particular tract but especially in cities it is not as unlikely that everybody rents.

Household size

The median and mean household size is 3 persons (rounded). Data is not available for 30 tracts.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.020   2.530   2.970   3.037   3.490   9.750      30

When we look at the histogram is seems that very few tracts (61) have an average household size of more than 5 members.

## [1] 61
##   owner_hhsize    renter_hhsize  
##  Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 2.530   1st Qu.:2.460  
##  Median : 2.950   Median :3.020  
##  Mean   : 3.042   Mean   :3.088  
##  3rd Qu.: 3.490   3rd Qu.:3.660  
##  Max.   :11.910   Max.   :8.840  
##  NA's   :30       NA's   :30

Renter households are slightly larger than owner households, but the range is larger for owner households.

When looking at the distribtutions for both renter (blue) and owner (green) household sizes, we see that the owner distribtuion is taller and peaks around 3 people, while for renters there is a less distinguish peak. There are more renter households with 4 or 5 people, which is kind of surprising as you would expect that people who own their house to be older, financially able to buy a house and maybe have live in children than in general younger renters. However with the current housing situation in parts of California (like the San Francisco Bay area) this general rule might not be valid as more people are sharing housing with non-family members.

The peak at 0 is when there are really zero renter or owner households in that tract, or a very low number and then the variable was not computed by the Census Burea.

Median income

The median income shows which income was at the midpoint of a frequency distibution. For California there are quite some high outliers which pulls the mean up. The median and mean of the median income are respectively 60,530 and 67,334 dollar. With 75% of the tracts having a median income of 84,704 dollar or less. For 53 tracts there is no median income data available.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4541   42688   60530   67334   84704  250000      53

As the distribution is long tailed, the x axis is transformed (log 10) and we see a (somewhat whimsical) normal distribution. There is a huge jump at 30,000 dollar of tracts that have that median income.

Missing values

There are 53 missing values for the median income. I used the naniar library to visualize the other variables that are missing for these 57 tracts.

The plot shows the number of other variables (especially census variables) that are also missing. When scrolling to the data it became clear that the tracts that that are missing census variable are missing multiple. The missing voting variables seem to be affecting other tracts.

Of the variables that have no missing values the participation rate stands out as it is the only census variable that has no missing values (note that population variable already has been cleaned for zero values).

Participation rate

The paricipation rate is clustered around the mean and median, that lie both around 63-64%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   59.20   64.10   63.38   68.83  100.00

The distribution looks normal. It is interesting that there are a number of tracts that have a partipation rate of 0 or 100%. Wondering about this as participation rate did not have any missing values for the 53 tracts that were missing many other (census) variables.

0% & 100%

I noticed that almost all tracts that have a participation rate of 100% are special land tracts. The tract in San Diego, is the airport so this probably should also be special_land tract.

I looked up some of the 0% tracts here and found that often only a prison or detention center was located in these tracts (regardless if they were marked special_land or not). I also noticed that when looking up some of these tracts, that their population is mainly male. I therefore added a variable for the male percentage of the population to this subset.

Only two tracts have a lower male percentage than expected. The tract with 50% is a special area and is a national park. The second lowest is a tract where only a VA medical center is located. All other tracts seem to be tracts with either prisons or detention centers.

The tracts for which the median income was missing seemed to overlap with the tracts that have a 0 or 100% participation rate as can be seen in the plot below that shows missing values for this subset.

I therefore decided to drop the tracts for which the median income is missing, which leaves us with a dataset of 7,959 tracts (+ 25 variables).

## [1] 7959   25
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.40   59.20   64.10   63.59   68.90  100.00

The statistics have not changed a lot, except that the min value is no longer 0, and the 100% partipation rates are only for special_land tracts in parks.

Unemployment rate

The unemployment rate ranges from 0 to 60.5%. The mean unemployment rate is 10.16% and the median 9.30%. There is a very long long tail.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.60    9.30   10.16   12.80   60.50

When only taking into account values that lie in the 99% range, we still see a longtail, but most tracts have a unemployment rate that is less than 16%.

Poverty rate

The poverty rate ranges from 0 to 91.8%. The mean is 16.49% and the median is 13.30%. Also here there is a long tail.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.30   13.30   16.49   23.20   91.80

Even when looking at only 95% of the values there is still a long tail. Most tracts have a poverty rate between 5 and 10%. It is not surprising that there is a longtail as higher poverty rates are more unlikely.

Broadband data

Download speed

Note that broadband data had to be summarized, the median was calculated per tract as the distribution on block level was very long tailed. The median advertised download speed ranges between 2 and 40 Mbps. There is one missing value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   15.00   15.00   15.94   15.00   40.00       1

The median and 3rd quartile are both 15. This is also clear from the histogram, there are also peaks at 12, 25 and 30 Mbps.

Upload speed

Median advertised upload speed ranges from 1 to 20 Mbps. The mean and median are respectively 2 and 2.8 Mbps.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.000   2.000   2.822   3.000  20.000       1

There are only 384 (4.8%) tracts that have a median upload speed higher than 3 Mbps.

## [1] 384

Number of providers

Overal, tracts have on average ~7 providers (also the median).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   4.000   6.000   7.000   7.232   8.000  13.000       1

50% of the tracts have between 6 and 8 providers (IQR).

Voting data

Voting participation

It seemed interesting to see the voting participation, calculating the total votes casted divided by the population per tract.

Variable added:

  • voting_part (total votes divided by population)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   0.3523  27.6676  38.4196  39.3654  50.1880 340.5895       31

The median and mean are ~ 39%, meaning that 39% of the population per tract voted. Data for 31 tracts is missing.

As population is made up also from non-eligble voters (children etc) we are not expecting to see 100%+ values. 11 tracts have a a voting participation of more than 100%, but more tracts might be affected. I can think of two reasons why this is happening:

  1. Votes might have been assigned incorrectly when mapping from precincts to blocks. I have gone over this a couple of times and I cannot find where it goes wrong in my mapping, so I am not sure if there is something in the original mapping files or my method.
  2. Population and total votes number come from different datasets, so maybe numbers are assigned to wrong tracts.
## [1] 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3523 27.6467 38.3857 39.1951 50.1410 90.0239

Removing these 11 tracts, does not do much to the median and mean.

The distribution looks relatively normal, with not a clear peak though. 75% of the tracts have a voting participation of 50% or less.

Missing values

Data for 31 tracts is missing.

Plotting these tracts on a map shows that the whole county of Imperial is missing (the white tract was already dropped before) plus one tract in LA. Checking back with the database it notes that the data is not yet available for these tracts.

All census and broad variables are available for these tracts, so I am not dropping them. When looking at voting data these tracts will not be included.

Democrats vs Republicans

The added variable ‘winner’, shows which party got the most votes. In almost 6,500 tracts the Democratic party got the most votes. That is 80% of thet racts. (Tracts without a winner are in Imperial & LA, see above.)

## [1] 80.34929
##       dem             rep       
##  Min.   :13.38   Min.   : 0.00  
##  1st Qu.:49.15   1st Qu.:15.36  
##  Median :64.34   Median :25.83  
##  Mean   :62.13   Mean   :28.59  
##  3rd Qu.:76.00   3rd Qu.:40.82  
##  Max.   :92.70   Max.   :81.90  
##  NA's   :31      NA's   :31

For all tracts the median and mean percentage votes for Democrats are 64.34 and 62.13% respectively. The median and mean are 25.83 and 28.59% for the Republicans. The Democratic party is more popular than the Republican party in the Californian tracts.

The histograms show the distribution for the tracts were the party was the largest. Please note that not 50% of the votes is needed as multiple parties participated. It is clear that if the Democrats won, they mostly did with a higher percentage of the votes on average 68% vs 54% for the Republicans. The Republican distribution peaks before 50%, while the Democrats one only peaks at 70% and its distribution is more level. Also the maximal percentage of votes is lower for the Republicans, 10%-point.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.78   58.31   69.37   68.38   78.30   92.70
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   41.33   48.46   52.83   54.18   58.53   81.90

Summary

Structure

All together 45 (for having no population value) plus 53 tracts (for not having median income value) have been dropped. Which leaves us with data for 7959 tracts (+ 26 variables).

## [1] 7959   26

The dropped tracts are shown on the map. As can be seen it is only a couple of tracts per county, except for LA but LA has many tracts. We also know that these dropped tracts are either not inhabited or are water, special areas (national parks, business parks or jails/detention centers). Next to their special character also the multiple missing variables for these tracts justifies to not take them into account in the exploration.

There are 26 variables:

  • Tract info - 4 (Id, county, county_name, special)
  • Census variables - 14 (population, total_hh, owner_hh, renter_hh, total_hhsize, owner_hhsize, renter_hhsize, owner, median_income, participation_rate, unemployment_rate, poverty_rate, square_miles, pop_sqmiles)
  • Broadband variables - 3 (median_down, median_up, nr_providers)
  • Voting variables - 5 (total_vote, dem, rep, winner, voting_part)

When plotting the tracts on a map, 6 addditonal variables (including langitude and longitude coordinates) are needed.

Interesting variables

I am mostly interested in how population per square miles interacts with the other variables. And also how the median income and the winner (Democrats/Republicans) relate to other variables. I am also interested if certain tracts have less or more access to broadband.

The first exploration has shown that the census tracts in California are very diverse in size, income, politic color but also other social and economic variables. It will be interesting to see if there are difference between poor/rich tracts, Democrat or Republican tracts and low/high populated tracts.

New variables

Five variables were created:

  • pop_sqmiles (population divided by square miles)
  • winner (categorical variable showing party with most votes)
  • special (specify track as water_only, special_ land or regular)
  • voting_part (total votes divided by population)
  • owner (% of households that own their housing)

Bivariate Analysis

I decided not to look at variables that are absolute numbers (like population, household size, the total number votes and square miles) as the tracts are so diverse. It seems more logic to look at these variables adjusted for size (like owner, population per square mile and voting participation).

The correlation matrix below shows that some variables correlate with other variables (shown in dark red or blue), but others clearly do not correlate at all (faint blue or red). Blue shows a positive correlation, red a negative one.

Broadband variables

Interesting is that the broadband variables (median_down, median_up and nr_providers) show no meaningful correlation at all with the other variables in the dataset, except among themselves. Up and download speeds are not even very strongly correlated (0.59).

I was kind of expecting that higher broadband speeds would be available in higher populated and maybe higher income tracts. It might be that this been influenced by the fact that the median per tract had to be calculated.

Population per square mile

We see that the areas that have a high population density are around the big cities (+300,000, from maps library). San Francisco, Oakland and San Jose in the Bay area, Sacramento in Nothern California. Fresno and Bakersfield in the Valley. Anaheim, Santa Ana, LA, Anaheim and Riverside in the larger LA area and San Diego in the South.

The cities are indicated as point, they can only be mapped to one tract while multiple tracts can make up a city. I therefore looked if there was another way to visualize this.

Vs type

I looked for data marking rural and urban areas.The Census Bureau identifies two types of urban areas:

  • urbanized areas, 50,000 or more people
  • urban clusters, at least 2,500 and less than 50,000 people

Rural areas are areas that are not included in one of the two urban types.

I again used the tigris library to find the urbanized areas and urban clusters. I used this tutorial to find a way to know if urbanized areas or urban clusters are located in a tract.

Plotted we see the same pattern, in dark blue you see the urbanized areas which are indeed where we expect them to be but are more extensive as whole tracts are now marked. Same is true for the urban clusters. Few tracts in the north and east of the state have no urban cluster or urbanized areas.

I added this variable as a second categorical variable so that tracts can be compared on the type of tract.

Variable added:

  • type (urbanized areas, urban clusters, rural)

As expected the median population per square mile is much higher in urbanized areas than in urban clusters and rural tracts caused by smaller tracts with bigger poulations. Only 95% of the values are shown as there are many outliers for the urbanized areas tracts.

Vs median income

There is a weak negative correlation (-0.26) between the median income and population per square mile. As both variables are longtailed, they are plotted both on a log 10 scale. The correlation seems to be positively at low levels, but becomes more dense (most tracts have pop_sqmiles between 5 and 10,000) and gets a negative direction. Incomes under 30,000 dollar are more likely to be in highly populated tracts.

Median incomes are higher in urbanized areas (U) than in the urban clusters and rural tracts. The median incomes in the last two are pretty similar.

Vs winner

If we look at the population per square mile, it is clear that in general the Democratic tracts are more densely populated than the Republican tracts. Note that only 95% of the values are shown due to high outliers for the Democrat tracts.

Vs owner

There is a negative correlation between population per square mile and ownership (-0.55). Meaning that more people live in a tract, ownership by households is lower. This pattern becomes stronger at 5,000 people per square mile. This makes sense as we would expect house ownership to be lower in less populated areas. Especially as in parts of California housing prices are the highest in the country and owning a house is not a possibility for a large part of the population.

And indeed if we look at the three different types, median home ownership is highest in rural areas, slightly above 70% and lowest in urbanized areas at slightly above 55%.

Vs poverty rate

The correlation between the poverty rate and population per square mile is weak (0.31). It seems that till 1,000 people per square mile the relation is 0, but it becoome positive after 5,000 people although there it is still very disperse.

Median poverty rate is highest in tracts that are marked as urbanized clusters (17.4%) and lowest in urbanized areas (12.9%). It is good to notice that the range as well as the number of outliers is larger in tracts marked as urbanized areas.

## # A tibble: 3 x 2
##              type median_poverty
##            <fctr>          <dbl>
## 1 urbanized areas           12.9
## 2  urban clusters           17.4
## 3           rural           15.4

Vs uneployment rate

Unemployment rates are highest in rural areas (11.8%), but only slightly higher than in urban clusters (11.6%). For 75% of the urbanized areas tracts the rate is under 12.5%, but most top outliers are also in this group.

## # A tibble: 3 x 2
##              type median_poverty
##            <fctr>          <dbl>
## 1 urbanized areas            9.1
## 2  urban clusters           11.6
## 3           rural           11.8

Combining both results it seems that unemployment is less correlated with poverty in rural areas than in urban clusters.

Vs voting participation

Note that only tracts with a voting participation of 100% or less are taken into account.

There is a weak negative correlation between the people per square mile and voting particpation (-0.29). If a tract has more people per square mile the voting participation is lower.

Voting participation increases with less urbanization, from 37.5% in urbanized areas to 47.5% in rural areas.

Vs % party votes

There is a negative relationship (-0.51) between the population per square mile and the perentage of the votes for the Republican party. Relationship becomes more strong when the density is higher, this is probably due that the most dense areas are voting Democratic.

It is pretty clear that the Republican party median percentage of the votes is half in urban areas. The opposite is true for the Democratic party. The Republican party is also more popular in urban clusters.

Vs household size

Households seems to be the largest in the urbanized areas.They also have the most outliers (households with more than 5 members).

Median income

The median income is higher on the tracts along the coast and in highly populated areas. The San Franciso Bay area, Monterrey, Santa Barbara and the greater LA and see the highest median incomes. As showed before, median income is higher in urbanized areas.

Vs voting participation

The voting participation (total votes divided by the population) per tract is reasonable strong positively correlated (0.59) with the median income. If the median income is higher in a tract, also a bigger part of the population voted.

Vs winner

The median income for tracts were the Republican party was the largest is higher than in tracts where the Democratic party was the largest, but not by much. The tracts where the Democratic party was larger there are more outliers.

Vs owner

Households who owns their house correlates reasonable strong positvely (0.61) with the median income.

If a higher percentage of a tract owns their homes, the median income is also higher. As you have to have more income in general to be able to by a house, this is not a surprise that these variables move together.

Vs unemployment

Unemployment rate in a tract and the median income are negatively correlated (-0.53). Expected, as higher unemployment rates correlate with lower median incomes.

Vs poverty

As one can expect there is also a strong negative correlation (-0.73) between the poverty rate and the median income in a tract. The poverty and unemployment rate are also positively correlated (0.57), which is also not surprising as these variables move hand in hand.

Winner

The Democratic party was the largest in 80% of the tracts, if we however plot this on the map, California looks mostly red (Republicans).

Vs square miles

Most of the tracts where the Democrats won are much smaller than the tracts were the Republicans won. The distribution for the Democrats is very longtailed though, 75% of the tracts are smaller than 1.12 square mile. Only 95% of the values are shown in order to show the boxplot for the Democrats. As we have seen before the population per square miles shows the opposite, Democrats won in more densely populated areas.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0218    0.3420    0.5876    5.7765    1.1207 1599.8826

Vs poverty & unemployment rates

The median poverty rate is higher for Democratic (14.4%) tracts than for Republican tracts (10%). The median unemployment rate is however pretty similar for both at 9.2-3%.

winner median_poverty median_unemployment
Democrats 14.4 9.3
Republicans 10.0 9.2

Vs voting participation

Voting participation in Democratic tracts (~35%) is lower than in Republican tracts (~46%). There are more outliers for tracts where the Republicans won, both on the top and the bottom.

Vs household size

The household size in Democrat tracts seems to be slightly higher. The plot also shows that the tracts with larger households (above 5) are where the Democrats won.

Vs owner

Ownership is higher (70% +) in tracts where Republicans won. Ownership in Democratic tracts is much lower (+ 50%). The Democratic distribution seems however almostperfectly symmetrical. There are quite some outliers for the Republican tracts.

Voting participation

Voting participation correlates with many variables in this dataset. Note that we only plot tracts where voting participation was below 100%. For 11 tracts the voting participation is higher which seems to be due to incorrect data. Therefore I am bit reluctant about the results for this variable as the higher correlations found might be due to data errors.

Voting participation is particularly low in the Central valley. Wondering what can cause this. Little interest for politics is one, but it might also be caused by big families or people that are not allowed to vote which both increase the population but not the number of votes.

Vs household size

Voting partipation is significantly negatively correlated with household size (-0.6). How more people live in an average household, a smaller percentage in the tract voted. Probably due that bigger households, generate more population but not in the same way the total number of votes as children are not allowed to vote. There is not much difference in the correlation with household size for renters or owners (-0.51 and -0.5 respecitvely).

##               vtng_ rntr_ ownr_
## voting_part    1.00            
## renter_hhsize -0.50  1.00      
## owner_hhsize  -0.51  0.63  1.00

Vs owner

It seems that if tract has a larger percentage of the households who own their house also a bigger percentage has voted (0.48). The distribution is however very wide even though a upward movement is visible.

Vs poverty rate

There is a relatively strong negative correlation (-0.62) between voter participation and the poverty rate in a tract. When the poverty rate is lower, more people casted their vote. The distribution looks a bit concave.

Vs unemployment rate

There is a negative correlation between the unemployment rate and the voting participation rate (-0.42). When unemployment is higher in a tract a smaller percentage voted. There are some tracts that have very high unemployment rates which I do not show in the plot.

Winner vs type

I thought it would be interesting to split tracts on both the winner as well as the type variable. Used this tutorial. It is very clear that tracts are more likely to have voted Democratic if they are marked as urbanized areas, same is true for rural areas and Republican. For urban centers there is a 50:50 divide between the two winners.

Summary

Main variables

The population per square mile variable does not correlate strongly with any of the other variables in the dataset. Moderate correlations with the owner (-0.55) and party variables (0.52/-0.51).

The median income correlates moderately with the unemployment rate (-0.53), owner (0.61) and voting participation (0.59). It correlates more strongly with the poverty rate (-0.73), which is expected.

The winner variable shows that Democratic tracts are smaller, are more densely populated, have lower median incomes, lower ownership rates, higher poverty rates, similar unemployment rates and lower voting participation, but slightly larger households than Republican tracts. As there are more Democratic tracts the range of values are more diverse and there are multiple outliers.

The newly introduced type variable shows that urbanized ares have the highest median incomes, the lowest ownership, voting participation, poverty and unemployment rates, the largest households and the lowest % of votes for the Republican party. All variables become larger or smaller when moving to the urban center and rural tracts. Except for the poverty rate which is the highest in urban centers.

Urbanized tracts are most likely tracts where the Democratic party got the most votes. In rural tracts the Republican party was more likely to have won. For urban centers it was half half.

Other variables

The voting participation is the variable that correlates with many other variables although moderate. Positively with ownership (0.48) and the median income (0.59), negatively with poverty rate (-0.62) and household size (-0.6). The total votes and population per tract, that are used to calculate this variable are coming from different datasets, so there might be errors as was already indicated by the fact that some tracts had a voting participation higher than 100. When these tracts are removed, the correlation with the other variables however even increases with 2 a 3 procent point.

Strongest correlation

The strongest correlation are between the dem and rep variables (0.99), but as these are percentage that add up and both make up most of the votes this is not surprising. The second and third largest correlation are between the total household size and renter and owner household size variables (resp. 0.84 and 0.88) also no surprise here, as the total household size is the based of the other two. The fourth largest correlation is between the median income and the poverty rate variables (-0.73). These variables are also expected to relate to each other. No real surprises here

Multivariate Plots Section

The median poverty rate in urban centers was higher (17.4%) than in the areas. This was unique as for all other variables, the median would rise or fall when moving from most densely to least populated tracts.

type median_poverty median_income
urbanized areas 12.9 62302.0
urban clusters 17.4 47182.0
rural 15.4 46386.5

I wanted to see if the poverty rates in the urban clustered were correlated with median income and if there was any difference between Republican and Democratic tracts.

winner number_of_tracts
Democrats 232
Republicans 274

There are 506 tracts that are marked as urbanized clusters (excluding tracts that did not have voter results). For little over half (54%) of these tract the Republican party was the largest.

On the map

I mapped the poverty rates and the winner on the tracts that are marked as urban clusters.

As some tracts are small it is hard to see if most votes went to the Democratic or Republican party and the color of the dot indicating the poverty rate. However there are a couple of things that catch the eye.

It seems that the tracts are grouped with other tracts where the same party won. Coastal tracts and tracts in the valley voted mostly Democratic. In the bigger tracts the Republican mostly won. It is not super clear but the tracts where the poverty rate was the highest (40 and above) seem to be ones where the Democratic party got the most votes. Low poverty rates are mostly found in Republican tracts.

In a scatterplot

As the results were not super clear due to the size of some tracts I opt for an other approach. I decided to look at the correlation between the poverty rate and median income in the different type of tracts in scatterplots.

The correlation for the whole dataset was -0.73. If we look at the subsets based on type (level of urbanization) the correlation is the highest for the urban centers (0.75). We also see that almost all tracts with a poverty rate above 40% in the urban clusters voted Democratic.

If we split the data up in 6 groups based on type and winner, we see that the correlation is highest (-0.797) for urban clusters where the Democratic party got the most votes.

It is also visible that there are more tracts with an income above 100,000 dollars in urbanized areas and that in urbanized areas and urban clusters lower median incomes are more likely in Democratic tracts. The highest poverty rates are more likely to be in Democratic tracts. Except for rural areas where the distribution looks pretty similar which is also reflected in the correlation coefficient.

If we split the urban cluster distribution per winning party. It shows even more clear that for Democratic tracts the poverty rate and median income are more highly correlated (-0.80). The Democratic distribution is leaner and less whimsical as the Republican.

We also see that the median poverty for Democratic tracts is higher 21.1% (vs 15.9%). The median median income in Republican tracts is higher (49,562.5 vs 44,212 dollars). So a median Democratic tract has a higher poverty rate and lower median income.

Mann-Whitney U test

If we look at the distribution of the poverty rate in these histograms, we see that the distributions are not normal, especially the Democratic one has multiple peaks.

As the distributions are not normally distributed we can use the Mann–Whitney U test to test if the distributions are indeed non-similar. This is a nonparametric test which tests that it is equally likely that the poverty rate for a randomly selected Democratic tract will be greater than the poverty rate for a randomly selected Republican tract.

H0: D1 and D2 are identical Ha: D1 is shifted to the right of D2

D1 is the probability distribution for Democratic tracts, D2 is the probability distribution for Republican tracts.

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  poverty_rate by winner
## W = 39728, p-value = 1.251e-06
## alternative hypothesis: true location shift is not equal to 0

The distributions in the two groups differs significantly. We can therefore conclude that poverty rates in Democratic tracts exceed the poverty rates for Republican tracts for this dataset (Mann–Whitney U = 39,728, n_Democrats = 232, n_Republicans = 274, P < 0.01 one-tailed).

Summary

The median poverty rate was highest in urban centers (17.4%), when looking at the correlation between the median income and poverty rate it became clear that also the correlation was the highest in these tracts (0.75). When we split up the urban center tracts to who got the most votes, Democratic tracts showed an even higher correlation (0.80 vs 71%) and a much higher median poverty rate of 21.1% (vs 15.9%) and lower median income (44,212 vs 49,563) than Republican tracts. So it seems that the higher median poverty rate for urban clusters was mostly caused by Democratic tracts with high median poverty rates.

Final Plots and Summary

Plot One

California is always described as a Democratic state. If you however look at this map it shows a different picture. Ofcourse it is misleading as we have seen that tracts that where Republicans are bigger and less dense populated. As result the Democratic party is the biggest in California, but it makes clear there is not one California and there are many areas in California where the Republican party is very present.

Plot Two

This overview of scatterplots shows the relations between the median income and the poverty rate. What catches the eye is the strong relation between the variables for Democratic urban clusters.

The poverty rates are lower in the Republican tracts as we can seen as the tracts are more clustered at the lower rates. The same is true for the median income Republican tracts are more clustered in the higher income ranks. The maximum poverty rates are found in the Democratic tracts (except for the rural tracts). This overview also shows very clear how many tracts are marked as urbanized areas and how many of these tracts voted Democratic.

Plot Three

Unfortunately this map did not come out as nice as it could be as there are so many tracts and especially the small ones make it hard to add an additional variable.

That said, it is still interesting to see that there are clusters of tracts that voted either Republican or Democratic. The Democratic cluster in the middle of the state is interesting. I am wondering as this is an agricultural area and if it therefore also sees high poverty rates. Almost all black dots (poverty rate of 50% and above) are found in Democratic tracts.

Reflection

Challenges

The first challenge was to find all the different data and combine them to one dataset. It was surprisingly that broadband and census data was available per census tract. I thought it would be easy to combine this with the voting data. Unfortunately that proved to be a bit more difficult as voting results were available per precinct instead of census tracts. It took a bit to figure out how to combine the different voting datasets to get the the voting results per census tract.

After all this time-consuming work it seems that something went wrong as voting participation was too high in some tract. Probably votes have been assigned incorrectly when mapping from precincts to blocks. I have gone over this a couple of times and I cannot find where it goes wrong in my mapping, so I am not sure if there is something in the original mapping files (there was also an issue with votes in the middle of water tracts) or my method. It can also be caused by the fact that the population data was coming from a different dataset.

I was also struggling finding relations, I kind of hoped for relations between population density and broadband speed for one. But there was nothing there. It was only for the categorical variables (winner and type), which made for some interesting findings. However both were created so might not be too reliable.

Successes

I am new to R, so plotting the results on maps, was a bit of challenge but with insights from different tutorial (marked in main text), I figured out how to customize this for my use. To be honest I still do not totally understand what is happening exactly, but it works!

Future research

It would be interesting to explore further if something is wrong with the voting data and for example look further into why the voting participation is so much higher in rural areas? Maybe compare total votes versus population over 18 to get a cleaner rate.

Sources

Next to the links in the text, I have also made use of: